Quick Start

Quick Start

Prerequisites

Before starting, you will need:

In addition to Hadoop, scoobi uses sbt (version 0.12.0) to simplify building and packaging a project for running on Hadoop. We also provide an sbt plugin sbt-scoobi to allow you to contain a self-contained JAR for hadoop.

Directory Structure

Here the steps to get started on your own project:

$ mkdir my-app
$ cd my-app
$ mkdir -p src/main/scala

We first can create a build.sbt file that has a dependency on Scoobi:

name := "MyApp"

version := "0.1"

scalaVersion := "2.9.2"

libraryDependencies += "com.nicta" %% "scoobi" % "0.5.0-cdh4"

scalacOptions ++= Seq("-Ydependent-method-types", "-deprecation")

resolvers += "Sonatype-snapshots" at "http://oss.sonatype.org/content/repositories/snapshots"

Write your code

Now we can write some code. In src/main/scala/myfile.scala, for instance:

package mypackage.myapp

import com.nicta.scoobi.Scoobi._

object WordCount extends ScoobiApp {
  def run() {
    val lines = fromTextFile(args(0))

    val counts = lines.flatMap(_.split(" "))
                      .map(word => (word, 1))
                      .groupByKey
                      .combine((a: Int, b: Int) => a + b)
    
    persist(toTextFile(counts, args(1)))
  }
}

Running

The Scoobi application can now be compiled and run using sbt:

> sbt compile
> sbt run-main mypackage.myapp.WordCount input-files output

Your Hadoop configuration will automatically get picked up, and all relevant JARs will be made available.

If you had any trouble following along, take a look at Word Count for a self contained example.